Introduction ¶

SpamBase dataset is a classification dataset containing 4601 emails sent to HP (Hewlett-Packard) during some period of time. The SpamBase contains numeric 57 features for each email and a binary label, 1 for spam, 0 for ham(email). This is a typical binary classification problem with the added need for clever feature selection, as many of the features provided in the dataset might be useless (Hopkins et al., 1998).

Exploratory Data Analysis ¶

In this section, we use various python libraries like the pandas dataframe, numpy, matplotlib, and seaborn to do a data summary and data visualization of the spambase dataset. This will give us a better understanding of the distributions and relationships between various features and the target variable(label).

In [ ]:
import numpy as np 
import pandas as pd
#for data visualisation:
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline
In [ ]:
# read the file containing the column names
with open('spambase.names') as f:
    list_contents = f.readlines()
    colnames = []
    for item in list_contents:
        colname = item.split(':')[0]
        colnames.append(colname)
colnames.append('label')
In [ ]:
# get the length of the features(columns)
len(colnames)
Out[ ]:
58
In [ ]:
# read the file containing the dataset and assign it to a variable
dataset = pd.read_csv('spambase.data', header=None)

dataset.columns = colnames
In [ ]:
# get the first five rows of the dataset
dataset.head()
Out[ ]:
word_freq_make word_freq_address word_freq_all word_freq_3d word_freq_our word_freq_over word_freq_remove word_freq_internet word_freq_order word_freq_mail ... char_freq_; char_freq_( char_freq_[ char_freq_! char_freq_$ char_freq_# capital_run_length_average capital_run_length_longest capital_run_length_total label
0 0.00 0.64 0.64 0.0 0.32 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.000 0.0 0.778 0.000 0.000 3.756 61 278 1
1 0.21 0.28 0.50 0.0 0.14 0.28 0.21 0.07 0.00 0.94 ... 0.00 0.132 0.0 0.372 0.180 0.048 5.114 101 1028 1
2 0.06 0.00 0.71 0.0 1.23 0.19 0.19 0.12 0.64 0.25 ... 0.01 0.143 0.0 0.276 0.184 0.010 9.821 485 2259 1
3 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 0.63 ... 0.00 0.137 0.0 0.137 0.000 0.000 3.537 40 191 1
4 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 0.63 ... 0.00 0.135 0.0 0.135 0.000 0.000 3.537 40 191 1

5 rows × 58 columns

In [ ]:
# get the last five rows of the dataset
dataset.tail()
Out[ ]:
word_freq_make word_freq_address word_freq_all word_freq_3d word_freq_our word_freq_over word_freq_remove word_freq_internet word_freq_order word_freq_mail ... char_freq_; char_freq_( char_freq_[ char_freq_! char_freq_$ char_freq_# capital_run_length_average capital_run_length_longest capital_run_length_total label
4596 0.31 0.0 0.62 0.0 0.00 0.31 0.0 0.0 0.0 0.0 ... 0.000 0.232 0.0 0.000 0.0 0.0 1.142 3 88 0
4597 0.00 0.0 0.00 0.0 0.00 0.00 0.0 0.0 0.0 0.0 ... 0.000 0.000 0.0 0.353 0.0 0.0 1.555 4 14 0
4598 0.30 0.0 0.30 0.0 0.00 0.00 0.0 0.0 0.0 0.0 ... 0.102 0.718 0.0 0.000 0.0 0.0 1.404 6 118 0
4599 0.96 0.0 0.00 0.0 0.32 0.00 0.0 0.0 0.0 0.0 ... 0.000 0.057 0.0 0.000 0.0 0.0 1.147 5 78 0
4600 0.00 0.0 0.65 0.0 0.00 0.00 0.0 0.0 0.0 0.0 ... 0.000 0.000 0.0 0.125 0.0 0.0 1.250 5 40 0

5 rows × 58 columns

In [ ]:
# check datatype for all columns and rows
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 58 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   word_freq_make              4601 non-null   float64
 1   word_freq_address           4601 non-null   float64
 2   word_freq_all               4601 non-null   float64
 3   word_freq_3d                4601 non-null   float64
 4   word_freq_our               4601 non-null   float64
 5   word_freq_over              4601 non-null   float64
 6   word_freq_remove            4601 non-null   float64
 7   word_freq_internet          4601 non-null   float64
 8   word_freq_order             4601 non-null   float64
 9   word_freq_mail              4601 non-null   float64
 10  word_freq_receive           4601 non-null   float64
 11  word_freq_will              4601 non-null   float64
 12  word_freq_people            4601 non-null   float64
 13  word_freq_report            4601 non-null   float64
 14  word_freq_addresses         4601 non-null   float64
 15  word_freq_free              4601 non-null   float64
 16  word_freq_business          4601 non-null   float64
 17  word_freq_email             4601 non-null   float64
 18  word_freq_you               4601 non-null   float64
 19  word_freq_credit            4601 non-null   float64
 20  word_freq_your              4601 non-null   float64
 21  word_freq_font              4601 non-null   float64
 22  word_freq_000               4601 non-null   float64
 23  word_freq_money             4601 non-null   float64
 24  word_freq_hp                4601 non-null   float64
 25  word_freq_hpl               4601 non-null   float64
 26  word_freq_george            4601 non-null   float64
 27  word_freq_650               4601 non-null   float64
 28  word_freq_lab               4601 non-null   float64
 29  word_freq_labs              4601 non-null   float64
 30  word_freq_telnet            4601 non-null   float64
 31  word_freq_857               4601 non-null   float64
 32  word_freq_data              4601 non-null   float64
 33  word_freq_415               4601 non-null   float64
 34  word_freq_85                4601 non-null   float64
 35  word_freq_technology        4601 non-null   float64
 36  word_freq_1999              4601 non-null   float64
 37  word_freq_parts             4601 non-null   float64
 38  word_freq_pm                4601 non-null   float64
 39  word_freq_direct            4601 non-null   float64
 40  word_freq_cs                4601 non-null   float64
 41  word_freq_meeting           4601 non-null   float64
 42  word_freq_original          4601 non-null   float64
 43  word_freq_project           4601 non-null   float64
 44  word_freq_re                4601 non-null   float64
 45  word_freq_edu               4601 non-null   float64
 46  word_freq_table             4601 non-null   float64
 47  word_freq_conference        4601 non-null   float64
 48  char_freq_;                 4601 non-null   float64
 49  char_freq_(                 4601 non-null   float64
 50  char_freq_[                 4601 non-null   float64
 51  char_freq_!                 4601 non-null   float64
 52  char_freq_$                 4601 non-null   float64
 53  char_freq_#                 4601 non-null   float64
 54  capital_run_length_average  4601 non-null   float64
 55  capital_run_length_longest  4601 non-null   int64  
 56  capital_run_length_total    4601 non-null   int64  
 57  label                       4601 non-null   int64  
dtypes: float64(55), int64(3)
memory usage: 2.0 MB
In [ ]:
#This gives us all the statistical summary for each column.
dataset.describe()
Out[ ]:
word_freq_make word_freq_address word_freq_all word_freq_3d word_freq_our word_freq_over word_freq_remove word_freq_internet word_freq_order word_freq_mail ... char_freq_; char_freq_( char_freq_[ char_freq_! char_freq_$ char_freq_# capital_run_length_average capital_run_length_longest capital_run_length_total label
count 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 ... 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000
mean 0.104553 0.213015 0.280656 0.065425 0.312223 0.095901 0.114208 0.105295 0.090067 0.239413 ... 0.038575 0.139030 0.016976 0.269071 0.075811 0.044238 5.191515 52.172789 283.289285 0.394045
std 0.305358 1.290575 0.504143 1.395151 0.672513 0.273824 0.391441 0.401071 0.278616 0.644755 ... 0.243471 0.270355 0.109394 0.815672 0.245882 0.429342 31.729449 194.891310 606.347851 0.488698
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.588000 6.000000 35.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.065000 0.000000 0.000000 0.000000 0.000000 2.276000 15.000000 95.000000 0.000000
75% 0.000000 0.000000 0.420000 0.000000 0.380000 0.000000 0.000000 0.000000 0.000000 0.160000 ... 0.000000 0.188000 0.000000 0.315000 0.052000 0.000000 3.706000 43.000000 266.000000 1.000000
max 4.540000 14.280000 5.100000 42.810000 10.000000 5.880000 7.270000 11.110000 5.260000 18.180000 ... 4.385000 9.752000 4.081000 32.478000 6.003000 19.829000 1102.500000 9989.000000 15841.000000 1.000000

8 rows × 58 columns

In [ ]:
# Get the shape of the dataset. the first element represents the number of rows and the second element represents the columns
np.shape(dataset)
Out[ ]:
(4601, 58)
In [ ]:
#check whether there are any missing or null values
dataset.isnull().sum()#This will give number of NaN values in every column.
Out[ ]:
word_freq_make                0
word_freq_address             0
word_freq_all                 0
word_freq_3d                  0
word_freq_our                 0
word_freq_over                0
word_freq_remove              0
word_freq_internet            0
word_freq_order               0
word_freq_mail                0
word_freq_receive             0
word_freq_will                0
word_freq_people              0
word_freq_report              0
word_freq_addresses           0
word_freq_free                0
word_freq_business            0
word_freq_email               0
word_freq_you                 0
word_freq_credit              0
word_freq_your                0
word_freq_font                0
word_freq_000                 0
word_freq_money               0
word_freq_hp                  0
word_freq_hpl                 0
word_freq_george              0
word_freq_650                 0
word_freq_lab                 0
word_freq_labs                0
word_freq_telnet              0
word_freq_857                 0
word_freq_data                0
word_freq_415                 0
word_freq_85                  0
word_freq_technology          0
word_freq_1999                0
word_freq_parts               0
word_freq_pm                  0
word_freq_direct              0
word_freq_cs                  0
word_freq_meeting             0
word_freq_original            0
word_freq_project             0
word_freq_re                  0
word_freq_edu                 0
word_freq_table               0
word_freq_conference          0
char_freq_;                   0
char_freq_(                   0
char_freq_[                   0
char_freq_!                   0
char_freq_$                   0
char_freq_#                   0
capital_run_length_average    0
capital_run_length_longest    0
capital_run_length_total      0
label                         0
dtype: int64
In [ ]:
#If needed, check NaN values in every column using the code: by default axis=0
dataset.isnull().sum(axis = 1)
Out[ ]:
0       0
1       0
2       0
3       0
4       0
       ..
4596    0
4597    0
4598    0
4599    0
4600    0
Length: 4601, dtype: int64
In [ ]:
# check if there's any duplicate row or column in the dataset 
dataset.duplicated()
Out[ ]:
0       False
1       False
2       False
3       False
4       False
        ...  
4596    False
4597    False
4598    False
4599    False
4600    False
Length: 4601, dtype: bool
In [ ]:
# plot a histogram of the dataset.
hist_of_dataset = dataset.hist(figsize = (30,20))
hist_of_dataset
Out[ ]:
array([[<AxesSubplot: title={'center': 'word_freq_make'}>,
        <AxesSubplot: title={'center': 'word_freq_address'}>,
        <AxesSubplot: title={'center': 'word_freq_all'}>,
        <AxesSubplot: title={'center': 'word_freq_3d'}>,
        <AxesSubplot: title={'center': 'word_freq_our'}>,
        <AxesSubplot: title={'center': 'word_freq_over'}>,
        <AxesSubplot: title={'center': 'word_freq_remove'}>,
        <AxesSubplot: title={'center': 'word_freq_internet'}>],
       [<AxesSubplot: title={'center': 'word_freq_order'}>,
        <AxesSubplot: title={'center': 'word_freq_mail'}>,
        <AxesSubplot: title={'center': 'word_freq_receive'}>,
        <AxesSubplot: title={'center': 'word_freq_will'}>,
        <AxesSubplot: title={'center': 'word_freq_people'}>,
        <AxesSubplot: title={'center': 'word_freq_report'}>,
        <AxesSubplot: title={'center': 'word_freq_addresses'}>,
        <AxesSubplot: title={'center': 'word_freq_free'}>],
       [<AxesSubplot: title={'center': 'word_freq_business'}>,
        <AxesSubplot: title={'center': 'word_freq_email'}>,
        <AxesSubplot: title={'center': 'word_freq_you'}>,
        <AxesSubplot: title={'center': 'word_freq_credit'}>,
        <AxesSubplot: title={'center': 'word_freq_your'}>,
        <AxesSubplot: title={'center': 'word_freq_font'}>,
        <AxesSubplot: title={'center': 'word_freq_000'}>,
        <AxesSubplot: title={'center': 'word_freq_money'}>],
       [<AxesSubplot: title={'center': 'word_freq_hp'}>,
        <AxesSubplot: title={'center': 'word_freq_hpl'}>,
        <AxesSubplot: title={'center': 'word_freq_george'}>,
        <AxesSubplot: title={'center': 'word_freq_650'}>,
        <AxesSubplot: title={'center': 'word_freq_lab'}>,
        <AxesSubplot: title={'center': 'word_freq_labs'}>,
        <AxesSubplot: title={'center': 'word_freq_telnet'}>,
        <AxesSubplot: title={'center': 'word_freq_857'}>],
       [<AxesSubplot: title={'center': 'word_freq_data'}>,
        <AxesSubplot: title={'center': 'word_freq_415'}>,
        <AxesSubplot: title={'center': 'word_freq_85'}>,
        <AxesSubplot: title={'center': 'word_freq_technology'}>,
        <AxesSubplot: title={'center': 'word_freq_1999'}>,
        <AxesSubplot: title={'center': 'word_freq_parts'}>,
        <AxesSubplot: title={'center': 'word_freq_pm'}>,
        <AxesSubplot: title={'center': 'word_freq_direct'}>],
       [<AxesSubplot: title={'center': 'word_freq_cs'}>,
        <AxesSubplot: title={'center': 'word_freq_meeting'}>,
        <AxesSubplot: title={'center': 'word_freq_original'}>,
        <AxesSubplot: title={'center': 'word_freq_project'}>,
        <AxesSubplot: title={'center': 'word_freq_re'}>,
        <AxesSubplot: title={'center': 'word_freq_edu'}>,
        <AxesSubplot: title={'center': 'word_freq_table'}>,
        <AxesSubplot: title={'center': 'word_freq_conference'}>],
       [<AxesSubplot: title={'center': 'char_freq_;'}>,
        <AxesSubplot: title={'center': 'char_freq_('}>,
        <AxesSubplot: title={'center': 'char_freq_['}>,
        <AxesSubplot: title={'center': 'char_freq_!'}>,
        <AxesSubplot: title={'center': 'char_freq_$'}>,
        <AxesSubplot: title={'center': 'char_freq_#'}>,
        <AxesSubplot: title={'center': 'capital_run_length_average'}>,
        <AxesSubplot: title={'center': 'capital_run_length_longest'}>],
       [<AxesSubplot: title={'center': 'capital_run_length_total'}>,
        <AxesSubplot: title={'center': 'label'}>, <AxesSubplot: >,
        <AxesSubplot: >, <AxesSubplot: >, <AxesSubplot: >,
        <AxesSubplot: >, <AxesSubplot: >]], dtype=object)
In [ ]:
# visualize if there's any missing value using a barchart. A blue bar represents the number of missing values.
# in this dataset we have no missing value.
sns.set(rc={'figure.figsize':(17,7)})
miss_vals = pd.DataFrame(dataset.isnull().sum() / len(dataset) * 100)
miss_vals.plot(kind='bar',title='Missing values in percentage',ylabel='percentage')
Out[ ]:
<AxesSubplot: title={'center': 'Missing values in percentage'}, ylabel='percentage'>

The Outcome ¶

We've successfully analysed the spambase dataset using various exploratory tools like graphs and statistical analysis so as to enable us do an informed data preprocessing.

Data Preprocessing ¶

In this section, we clean the dataset by filling the missing values(if any) with the mean of its column, we split our dataset into training and testing. We need to perform Feature Scaling when we are dealing with Gradient Descent Based algorithms. Scaling has no significant effect on tree based algorithms, so we would only scale it during the Neural network section. this is the section we also encode the target variable if it's not in binary format.

In [ ]:
# fill and missing data with te mean of its column.
dataset.fillna(dataset.mean(), inplace=True)
In [ ]:
from sklearn.model_selection import train_test_split
In [ ]:
# seperate the features from the target variable.
X = dataset.drop('label', axis=1)
y = dataset['label']
In [ ]:
# split the dataset into training set of 75% and test sets of 25%. We also set the stratify hyperparameter for equal distribution.
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)
In [ ]:
# get the shape of the split data
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out[ ]:
((3450, 57), (1151, 57), (3450,), (1151,))

The outcome ¶

We've been able to fill missing data, and split our dataset into training and testing. scaling will be applied only to the neural network model.

. . .

Feature Engineering and Feature Selection ¶

Here we apply feature engineering and feature selection to get the relevant features. We achieve this by using sklearn variance threshold, pearson correlation coefficient, and DecisionTree Feature_importance_.

In [ ]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import VarianceThreshold

Using scikit-learn Variance threshold to remove features with variance 0

In [ ]:
# check for variance within each feature and remove the features with variance=0
var_thres = VarianceThreshold(threshold=0.0) # set the threshold to 0
var_thres.fit(X)
Out[ ]:
VarianceThreshold()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
VarianceThreshold()
In [ ]:
# get the sum of all the features with a variance above 0 
sum(var_thres.get_support())
Out[ ]:
57
In [ ]:
# we check how many features have a variance of 0
constant_columns = [column for column in X_train.columns
                    if column not in X_train.columns[var_thres.get_support()]]

print(len(constant_columns))
0
In [ ]:
# drop features with variance of 0. we apply it to only X_train not the whole dataset.
X_train.drop(constant_columns,axis=1)
Out[ ]:
word_freq_make word_freq_address word_freq_all word_freq_3d word_freq_our word_freq_over word_freq_remove word_freq_internet word_freq_order word_freq_mail ... word_freq_conference char_freq_; char_freq_( char_freq_[ char_freq_! char_freq_$ char_freq_# capital_run_length_average capital_run_length_longest capital_run_length_total
4140 0.00 0.00 1.58 0.0 0.00 0.0 0.00 0.00 0.00 0.00 ... 0.0 0.000 0.000 0.000 0.000 0.000 0.000 1.230 4 16
918 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 ... 0.0 0.218 0.087 0.000 0.174 0.174 0.437 9.186 126 937
1250 0.00 0.00 0.84 0.0 0.84 0.0 0.84 0.00 0.00 0.00 ... 0.0 0.000 0.388 0.000 0.776 0.129 0.000 10.375 168 249
845 0.59 0.00 0.00 0.0 0.00 0.0 1.18 0.59 0.59 1.18 ... 0.0 0.000 0.000 0.000 0.421 0.000 0.000 6.275 46 182
199 0.51 0.51 0.00 0.0 0.00 0.0 0.51 0.00 0.00 0.51 ... 0.0 0.000 0.135 0.000 0.067 0.000 0.000 2.676 17 91
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3581 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 ... 0.0 0.000 0.000 0.000 0.000 0.000 0.000 1.384 4 18
4558 0.16 0.00 0.32 0.0 0.10 0.1 0.00 0.00 0.00 0.00 ... 0.0 0.025 0.017 0.008 0.000 0.008 0.008 1.318 12 244
3172 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 ... 0.0 0.000 0.000 0.000 0.000 0.000 0.000 2.428 5 17
4257 1.47 1.47 0.00 0.0 0.00 0.0 0.00 0.00 0.00 1.47 ... 0.0 0.000 0.000 0.000 0.000 0.000 0.000 2.391 21 55
132 0.00 0.00 1.12 0.0 0.56 0.0 0.00 0.00 0.00 0.56 ... 0.0 0.000 0.101 0.000 0.606 0.000 0.000 2.360 19 144

3450 rows × 57 columns

Checking for correlation between features after applying the scikit learn variance threshold

In [ ]:
#we check with the X_train not the whole features.
cor = X_train.corr()

# we use the seaborn heatmap to visualise the correlations.
cmap = sns.cm.rocket_r #for reversed color, the darker the more correlated.

plt.figure(figsize=(50, 50)) # set the size of the plot.

# initialise the seaborn heatmap.
ax = sns.heatmap(cor, linewidths=.3, annot=True, fmt=".2", cmap=cmap)#show numbers on the cells: annot=True

# for avoiding reseting labels
ax.tick_params(axis='x', labelrotation=45)

Using Pearson Correlation Coefficient to remove features with high correlations between them.

In [ ]:
# code partially gotten from https://www.youtube.com/watch?v=FndwYNcVe0U&list=PLZoTAELRMXVPgjwJ8VyRoqmfNs2CJwhVH&index=3

# Checking for correlation using pearson correlation coefficient
def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)-1):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff values.
                #We compare the feature correlation with target and drop the ones with lower coeff.
                if abs(corr_matrix.iloc[j, len(corr_matrix.columns)-1]) > abs(corr_matrix.iloc[i, len(corr_matrix.columns)-1]):
                    colname = corr_matrix.columns[i]
                else:
                    colname = corr_matrix.columns[j]
                col_corr.add(colname)
    return col_corr
In [ ]:
# Apply the pearson correlation to our dataset with a threshold of 90%.
corr_features = correlation(dataset, 0.90)

print(corr_features)

print('number of correlated features: '+str(len(set(corr_features))))
{'word_freq_415'}
number of correlated features: 1
In [ ]:
# drop the correlated features. the features that are related to target are kept.
dataset_feature_reduce = dataset.drop(corr_features, axis=1)
In [ ]:
# get the shape of the reduced dataset
dataset_feature_reduce.shape
Out[ ]:
(4601, 57)
In [ ]:
# get a new X and y from the reduced dataset
X_train, X_test, y_train, y_test = train_test_split(dataset_feature_reduce.drop('label', axis=1), dataset_feature_reduce['label'],
                                                    stratify=dataset_feature_reduce['label'], random_state=42)
In [ ]:
len(X_train.columns)
Out[ ]:
56
In [ ]:
# visualise it with seaborn heatmap
#reverse the color scheme: the darker the more positive related.
cmap = sns.cm.rocket_r 

plt.figure(figsize=(50, 50)) # set the size of the heatmap

#https://stackoverflow.com/questions/39409866/correlation-heatmap
view = sns.heatmap(dataset_feature_reduce.corr(), linewidths=.3, annot=True, fmt=".2", cmap=cmap)#show numbers on the cells: annot=True

# To avoid resetting labels
view.tick_params(axis='x', labelrotation=45) # tilt the x-label by 45 degree.

Let's compare the evaluation results before and after reducing features

In [ ]:
#before feature reduction
from sklearn.ensemble import RandomForestClassifier
model_rf = RandomForestClassifier(n_estimators=100)#todo: tune hyper param
scores_rf_featRed = cross_val_score(model_rf, dataset.drop('label', axis=1), dataset_feature_reduce['label'], scoring = "f1_weighted", cv=5)
print(scores_rf_featRed)
scores_rf_featRed.mean()
[0.94876861 0.94077062 0.95759847 0.97274908 0.82143156]
Out[ ]:
0.9282636688282416
In [ ]:
#after feature reduction
model_rf = RandomForestClassifier(n_estimators=100)
scores_rf_featRed = cross_val_score(model_rf, dataset_feature_reduce.drop('label', axis=1), dataset_feature_reduce['label'], scoring = "f1_weighted", cv=5)
print(scores_rf_featRed)
scores_rf_featRed.mean()
[0.9476932  0.94188636 0.95867561 0.97166672 0.82360847]
Out[ ]:
0.9287060722020681

Applying Adaboost before and after feature reduction

In [ ]:
#before feature reduction
# use the AdaBoost classifier with the default base classifier - DecisionTreeClassifier(max_depth=1) 
from sklearn.ensemble import AdaBoostClassifier
model_ada = AdaBoostClassifier(n_estimators=100)#todo: tune hyper param
scores_ada_feature_reduce = cross_val_score(model_ada, dataset.drop('label', axis=1), dataset_feature_reduce['label'], scoring = "f1_weighted", cv=5)
print(scores_ada_feature_reduce)
scores_ada_feature_reduce.mean()#best result: 0.9423
[0.94226055 0.94539341 0.9489726  0.95748857 0.82656792]
Out[ ]:
0.924136609628035
In [ ]:
#after feature reduction
model_ada = AdaBoostClassifier(n_estimators=100)
scores_ada_feature_reduce = cross_val_score(model_ada, dataset_feature_reduce.drop('label', axis=1), dataset_feature_reduce['label'], scoring = "f1_weighted", cv=5)
print(scores_ada_feature_reduce)
scores_ada_feature_reduce.mean()
[0.94226055 0.94539341 0.9489726  0.95748857 0.82656792]
Out[ ]:
0.924136609628035

The Outcome¶

There's no significant difference before and after dropping the feature because there's only one feature dropped and it was insignifiant to the target variable.

Sorting the features using the feature_importance_ and removing the features with importance <= 0 in relation to the target variable.

In [ ]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=42, max_depth=8,class_weight='balanced') 

model.fit(X_train,y_train)

# Get feature importances
importances = model.feature_importances_
In [ ]:
# View the important feature on a barplot
feat_importances = pd.DataFrame(importances, index=X_train.columns, columns=["Importance"])
feat_importances.sort_values(by='Importance', ascending=False, inplace=True)
feat_importances.plot(kind='bar', figsize=(16,7))
Out[ ]:
<AxesSubplot: >
In [ ]:
# Append the features greater than 0 to a new array
threshold = 0.00000
important_features = [feature for feature, importance in zip(X_train.columns, importances) if importance > threshold]

important_features
Out[ ]:
['word_freq_address',
 'word_freq_all',
 'word_freq_our',
 'word_freq_over',
 'word_freq_remove',
 'word_freq_internet',
 'word_freq_order',
 'word_freq_mail',
 'word_freq_receive',
 'word_freq_addresses',
 'word_freq_free',
 'word_freq_business',
 'word_freq_email',
 'word_freq_font',
 'word_freq_000',
 'word_freq_money',
 'word_freq_hp',
 'word_freq_hpl',
 'word_freq_george',
 'word_freq_650',
 'word_freq_labs',
 'word_freq_telnet',
 'word_freq_data',
 'word_freq_technology',
 'word_freq_1999',
 'word_freq_direct',
 'word_freq_meeting',
 'word_freq_re',
 'word_freq_edu',
 'char_freq_;',
 'char_freq_(',
 'char_freq_!',
 'char_freq_$',
 'char_freq_#',
 'capital_run_length_average',
 'capital_run_length_longest',
 'capital_run_length_total']
In [ ]:
# get the length of the new important features
len(important_features)
Out[ ]:
37
In [ ]:
# create a new dataset with only the important features
import_feat = dataset_feature_reduce[important_features]
import_feat.head()
Out[ ]:
word_freq_address word_freq_all word_freq_our word_freq_over word_freq_remove word_freq_internet word_freq_order word_freq_mail word_freq_receive word_freq_addresses ... word_freq_re word_freq_edu char_freq_; char_freq_( char_freq_! char_freq_$ char_freq_# capital_run_length_average capital_run_length_longest capital_run_length_total
0 0.64 0.64 0.32 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.000 0.778 0.000 0.000 3.756 61 278
1 0.28 0.50 0.14 0.28 0.21 0.07 0.00 0.94 0.21 0.14 ... 0.00 0.00 0.00 0.132 0.372 0.180 0.048 5.114 101 1028
2 0.00 0.71 1.23 0.19 0.19 0.12 0.64 0.25 0.38 1.75 ... 0.06 0.06 0.01 0.143 0.276 0.184 0.010 9.821 485 2259
3 0.00 0.00 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.00 ... 0.00 0.00 0.00 0.137 0.137 0.000 0.000 3.537 40 191
4 0.00 0.00 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.00 ... 0.00 0.00 0.00 0.135 0.135 0.000 0.000 3.537 40 191

5 rows × 37 columns

In [ ]:
cmap = sns.cm.rocket_r #reverse the color scheme: the darker the more positive related
plt.figure(figsize=(60, 50))

# Plot a heatmap of the important features
heat = sns.heatmap(import_feat.corr(), linewidths=.3, annot=True, fmt=".2", cmap=cmap)

heat.tick_params(axis='x', labelrotation=45)

The outcome ¶

In this example, we've applied sklearn variance threshold, pearson corr. coef. and decision tree feature_importance_ to extract the features that are relevant in predicting the outcome of an email.

Select, Train, Apply ML models ¶

In this section, we'll explore, compare and optimise various classification models,ensemble models and ANN model.

Using Support Vector Machine on the reduced spambase dataset

In [ ]:
from sklearn import svm
from sklearn.metrics import accuracy_score # to check the accuracy of the model

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(import_feat, y, stratify=y, test_size=0.25, random_state=42)

# Train the SVM model
clf = svm.SVC(kernel='linear') #Linear Kernel is used when the data is Linearly separable 
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the model's performance
acc = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(acc * 100))
Accuracy: 92.96%

Using Decision tree classifier

In [ ]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(max_depth=5, random_state=42)

dtc.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dtc.predict(X_test)

# Evaluate the model's performance
acc = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(acc * 100))
Accuracy: 90.53%

Using SGClassifier

In [ ]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(loss = 'modified_huber') # "modified_humber" brings tolerance to outliers as well as probability estimates.

sgd_clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dtc.predict(X_test)

# Evaluate the model's performance
acc = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(acc * 100))
Accuracy: 90.53%

Using Random Forest to classify the Spambase Dataset

In [ ]:
from sklearn.ensemble import RandomForestClassifier

# Train the random forest model
clf = RandomForestClassifier(n_estimators=100,max_depth=5, random_state=42)

clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the model's performance
acc = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(acc * 100))
Accuracy: 92.27%

Using Ensemble learning techniques boosting(Adaboost Classifier).

In [ ]:
from sklearn.ensemble import AdaBoostClassifier

# Train the AdaBoost model
ada_clf = AdaBoostClassifier(n_estimators=100, learning_rate=1.0, random_state=42) #Weight applied to each classifier at each boosting iteration

ada_clf.fit(X_train, y_train)


# Evaluate the model's performance
accuracy = ada_clf.score(X_test, y_test)

print("Accuracy: {:.2f}%".format(acc * 100))
Accuracy: 92.27%

Using Ensemble learning Bagging Technique

In [ ]:
from sklearn.ensemble import BaggingClassifier

# Define the base estimator
base_estimator = DecisionTreeClassifier(max_depth=10, random_state=42)

# Train the Bagging model
bag_clf = BaggingClassifier(estimator=base_estimator, n_estimators=100, random_state=42)

bag_clf.fit(X_train, y_train)

# Evaluate the model on the test set
accuracy = bag_clf.score(X_test, y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))
Accuracy: 92.96%

In this section, we use Artificial Neural Network(ANN) to classify the Spambase Dataset.

In [ ]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import MinMaxScaler
In [ ]:
sc = MinMaxScaler() # define the scaler
df_scaled = pd.DataFrame(sc.fit_transform(import_feat)) # fit & transform the data
print(df_scaled.head())
         0         1      2         3         4         5         6   \
0  0.044818  0.125490  0.032  0.000000  0.000000  0.000000  0.000000   
1  0.019608  0.098039  0.014  0.047619  0.028886  0.006301  0.000000   
2  0.000000  0.139216  0.123  0.032313  0.026135  0.010801  0.121673   
3  0.000000  0.000000  0.063  0.000000  0.042641  0.056706  0.058935   
4  0.000000  0.000000  0.063  0.000000  0.042641  0.056706  0.058935   

         7         8         9   ...        27        28        29        30  \
0  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000  0.000000   
1  0.051705  0.080460  0.031746  ...  0.000000  0.000000  0.000000  0.013536   
2  0.013751  0.145594  0.396825  ...  0.002801  0.002721  0.002281  0.014664   
3  0.034653  0.118774  0.000000  ...  0.000000  0.000000  0.000000  0.014048   
4  0.034653  0.118774  0.000000  ...  0.000000  0.000000  0.000000  0.013843   

         31        32        33        34        35        36  
0  0.023955  0.000000  0.000000  0.002502  0.006007  0.017487  
1  0.011454  0.029985  0.002421  0.003735  0.010012  0.064836  
2  0.008498  0.030651  0.000504  0.008008  0.048458  0.142551  
3  0.004218  0.000000  0.000000  0.002303  0.003905  0.011995  
4  0.004157  0.000000  0.000000  0.002303  0.003905  0.011995  

[5 rows x 37 columns]
In [ ]:
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df_scaled, y, stratify=y, test_size=0.25, random_state=42)
In [ ]:
# initialize the neural network
neu_net = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, alpha=1e-4,
                        solver='sgd', verbose=10, tol=1e-4, random_state=1, # solver specifies the algorithm for weight optimization over the nodes.
                        learning_rate_init=.1) 

# train the neural network
neu_net.fit(X_train, y_train)

# evaluate the model
y_pred = neu_net.predict(X_test)

print(classification_report(y_test, y_pred))
Iteration 1, loss = 0.69343255
Iteration 2, loss = 0.63775589
Iteration 3, loss = 0.59498817
Iteration 4, loss = 0.52785481
Iteration 5, loss = 0.45840216
Iteration 6, loss = 0.40233166
Iteration 7, loss = 0.36669967
Iteration 8, loss = 0.34060970
Iteration 9, loss = 0.32209685
Iteration 10, loss = 0.30958613
Iteration 11, loss = 0.30187719
Iteration 12, loss = 0.29017174
Iteration 13, loss = 0.28105919
Iteration 14, loss = 0.27950280
Iteration 15, loss = 0.26965102
Iteration 16, loss = 0.26550145
Iteration 17, loss = 0.26146178
Iteration 18, loss = 0.25868865
Iteration 19, loss = 0.25590298
Iteration 20, loss = 0.25229036
Iteration 21, loss = 0.24932949
Iteration 22, loss = 0.24553931
Iteration 23, loss = 0.24198683
Iteration 24, loss = 0.24024006
Iteration 25, loss = 0.23902999
Iteration 26, loss = 0.23652663
Iteration 27, loss = 0.23735150
Iteration 28, loss = 0.23383407
Iteration 29, loss = 0.23335405
Iteration 30, loss = 0.22848290
Iteration 31, loss = 0.22945971
Iteration 32, loss = 0.22670369
Iteration 33, loss = 0.22988831
Iteration 34, loss = 0.22625644
Iteration 35, loss = 0.22593809
Iteration 36, loss = 0.22222303
Iteration 37, loss = 0.22159780
Iteration 38, loss = 0.22322377
Iteration 39, loss = 0.22084836
Iteration 40, loss = 0.22004358
Iteration 41, loss = 0.21623610
Iteration 42, loss = 0.21601193
Iteration 43, loss = 0.21839138
Iteration 44, loss = 0.21665415
Iteration 45, loss = 0.21471715
Iteration 46, loss = 0.21303147
Iteration 47, loss = 0.21183811
Iteration 48, loss = 0.21213618
Iteration 49, loss = 0.21060671
Iteration 50, loss = 0.21010015
Iteration 51, loss = 0.21087013
Iteration 52, loss = 0.21115919
Iteration 53, loss = 0.20976592
Iteration 54, loss = 0.20881524
Iteration 55, loss = 0.20815008
Iteration 56, loss = 0.20475861
Iteration 57, loss = 0.20538371
Iteration 58, loss = 0.20456340
Iteration 59, loss = 0.20697367
Iteration 60, loss = 0.20763801
Iteration 61, loss = 0.20431188
Iteration 62, loss = 0.20307711
Iteration 63, loss = 0.20353348
Iteration 64, loss = 0.20186221
Iteration 65, loss = 0.20263261
Iteration 66, loss = 0.20268452
Iteration 67, loss = 0.20410869
Iteration 68, loss = 0.19851281
Iteration 69, loss = 0.20041458
Iteration 70, loss = 0.19893609
Iteration 71, loss = 0.19856752
Iteration 72, loss = 0.19520744
Iteration 73, loss = 0.19711629
Iteration 74, loss = 0.20224546
Iteration 75, loss = 0.19738338
Iteration 76, loss = 0.19452250
Iteration 77, loss = 0.19618929
Iteration 78, loss = 0.19542535
Iteration 79, loss = 0.19477962
Iteration 80, loss = 0.19301399
Iteration 81, loss = 0.19336297
Iteration 82, loss = 0.19241697
Iteration 83, loss = 0.19238705
Iteration 84, loss = 0.19896561
Iteration 85, loss = 0.19384432
Iteration 86, loss = 0.19112966
Iteration 87, loss = 0.19233425
Iteration 88, loss = 0.19251401
Iteration 89, loss = 0.19105247
Iteration 90, loss = 0.19009781
Iteration 91, loss = 0.18778175
Iteration 92, loss = 0.18922685
Iteration 93, loss = 0.18841970
Iteration 94, loss = 0.19046763
Iteration 95, loss = 0.18784575
Iteration 96, loss = 0.18821714
Iteration 97, loss = 0.18545463
Iteration 98, loss = 0.18877240
Iteration 99, loss = 0.18624763
Iteration 100, loss = 0.18780378
Iteration 101, loss = 0.18318539
Iteration 102, loss = 0.18432698
Iteration 103, loss = 0.18430055
Iteration 104, loss = 0.18373412
Iteration 105, loss = 0.18163919
Iteration 106, loss = 0.18406481
Iteration 107, loss = 0.18180152
Iteration 108, loss = 0.18445303
Iteration 109, loss = 0.18268441
Iteration 110, loss = 0.18664655
Iteration 111, loss = 0.18374929
Iteration 112, loss = 0.18563278
Iteration 113, loss = 0.18450609
Iteration 114, loss = 0.18402619
Iteration 115, loss = 0.18236231
Iteration 116, loss = 0.18302651
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
              precision    recall  f1-score   support

           0       0.95      0.92      0.94       697
           1       0.88      0.93      0.91       454

    accuracy                           0.92      1151
   macro avg       0.92      0.93      0.92      1151
weighted avg       0.93      0.92      0.92      1151

The outcome ¶

We were able to train and test our dataset using various classifiers and ensemble techniques, and each one performed well, some better than others. Now we would evaluate them to know which one performed better.

Evaluation ¶

Evaluate the models.

In [ ]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Define the models to evaluate
models = {"Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
          "Decision Tree":DecisionTreeClassifier(max_depth=5, random_state=42),
          "AdaBoost": AdaBoostClassifier(n_estimators=100, random_state=42),
          "SVM":svm.SVC(kernel='linear'),
          "SGD":SGDClassifier(loss = 'modified_huber'),
          "Bagging": BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=5, random_state=42), n_estimators=100, random_state=42),
          "Neural Network": MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, alpha=1e-4,
                        solver='sgd', verbose=10, tol=1e-4, random_state=1, learning_rate_init=.1)
         }

# Evaluate each model
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred) * 100
    precision = precision_score(y_test, y_pred) * 100
    recall = recall_score(y_test, y_pred) * 100
    f1 = f1_score(y_test, y_pred) * 100
    print(f"{name}:\n\tAccuracy: {accuracy:.2f}\n\tPrecision: {precision:.2f}\n\tRecall: {recall:.2f}\n\tF1-Score: {f1:.2f}")
Random Forest:
	Accuracy: 94.53
	Precision: 94.74
	Recall: 91.19
	F1-Score: 92.93
Decision Tree:
	Accuracy: 90.53
	Precision: 89.47
	Recall: 86.12
	F1-Score: 87.77
AdaBoost:
	Accuracy: 94.61
	Precision: 93.36
	Recall: 92.95
	F1-Score: 93.16
SVM:
	Accuracy: 89.05
	Precision: 93.39
	Recall: 77.75
	F1-Score: 84.86
SGD:
	Accuracy: 91.14
	Precision: 86.67
	Recall: 91.63
	F1-Score: 89.08
Bagging:
	Accuracy: 91.14
	Precision: 93.35
	Recall: 83.48
	F1-Score: 88.14
Iteration 1, loss = 0.69343255
Iteration 2, loss = 0.63775589
Iteration 3, loss = 0.59498817
Iteration 4, loss = 0.52785481
Iteration 5, loss = 0.45840216
Iteration 6, loss = 0.40233166
Iteration 7, loss = 0.36669967
Iteration 8, loss = 0.34060970
Iteration 9, loss = 0.32209685
Iteration 10, loss = 0.30958613
Iteration 11, loss = 0.30187719
Iteration 12, loss = 0.29017174
Iteration 13, loss = 0.28105919
Iteration 14, loss = 0.27950280
Iteration 15, loss = 0.26965102
Iteration 16, loss = 0.26550145
Iteration 17, loss = 0.26146178
Iteration 18, loss = 0.25868865
Iteration 19, loss = 0.25590298
Iteration 20, loss = 0.25229036
Iteration 21, loss = 0.24932949
Iteration 22, loss = 0.24553931
Iteration 23, loss = 0.24198683
Iteration 24, loss = 0.24024006
Iteration 25, loss = 0.23902999
Iteration 26, loss = 0.23652663
Iteration 27, loss = 0.23735150
Iteration 28, loss = 0.23383407
Iteration 29, loss = 0.23335405
Iteration 30, loss = 0.22848290
Iteration 31, loss = 0.22945971
Iteration 32, loss = 0.22670369
Iteration 33, loss = 0.22988831
Iteration 34, loss = 0.22625644
Iteration 35, loss = 0.22593809
Iteration 36, loss = 0.22222303
Iteration 37, loss = 0.22159780
Iteration 38, loss = 0.22322377
Iteration 39, loss = 0.22084836
Iteration 40, loss = 0.22004358
Iteration 41, loss = 0.21623610
Iteration 42, loss = 0.21601193
Iteration 43, loss = 0.21839138
Iteration 44, loss = 0.21665415
Iteration 45, loss = 0.21471715
Iteration 46, loss = 0.21303147
Iteration 47, loss = 0.21183811
Iteration 48, loss = 0.21213618
Iteration 49, loss = 0.21060671
Iteration 50, loss = 0.21010015
Iteration 51, loss = 0.21087013
Iteration 52, loss = 0.21115919
Iteration 53, loss = 0.20976592
Iteration 54, loss = 0.20881524
Iteration 55, loss = 0.20815008
Iteration 56, loss = 0.20475861
Iteration 57, loss = 0.20538371
Iteration 58, loss = 0.20456340
Iteration 59, loss = 0.20697367
Iteration 60, loss = 0.20763801
Iteration 61, loss = 0.20431188
Iteration 62, loss = 0.20307711
Iteration 63, loss = 0.20353348
Iteration 64, loss = 0.20186221
Iteration 65, loss = 0.20263261
Iteration 66, loss = 0.20268452
Iteration 67, loss = 0.20410869
Iteration 68, loss = 0.19851281
Iteration 69, loss = 0.20041458
Iteration 70, loss = 0.19893609
Iteration 71, loss = 0.19856752
Iteration 72, loss = 0.19520744
Iteration 73, loss = 0.19711629
Iteration 74, loss = 0.20224546
Iteration 75, loss = 0.19738338
Iteration 76, loss = 0.19452250
Iteration 77, loss = 0.19618929
Iteration 78, loss = 0.19542535
Iteration 79, loss = 0.19477962
Iteration 80, loss = 0.19301399
Iteration 81, loss = 0.19336297
Iteration 82, loss = 0.19241697
Iteration 83, loss = 0.19238705
Iteration 84, loss = 0.19896561
Iteration 85, loss = 0.19384432
Iteration 86, loss = 0.19112966
Iteration 87, loss = 0.19233425
Iteration 88, loss = 0.19251401
Iteration 89, loss = 0.19105247
Iteration 90, loss = 0.19009781
Iteration 91, loss = 0.18778175
Iteration 92, loss = 0.18922685
Iteration 93, loss = 0.18841970
Iteration 94, loss = 0.19046763
Iteration 95, loss = 0.18784575
Iteration 96, loss = 0.18821714
Iteration 97, loss = 0.18545463
Iteration 98, loss = 0.18877240
Iteration 99, loss = 0.18624763
Iteration 100, loss = 0.18780378
Iteration 101, loss = 0.18318539
Iteration 102, loss = 0.18432698
Iteration 103, loss = 0.18430055
Iteration 104, loss = 0.18373412
Iteration 105, loss = 0.18163919
Iteration 106, loss = 0.18406481
Iteration 107, loss = 0.18180152
Iteration 108, loss = 0.18445303
Iteration 109, loss = 0.18268441
Iteration 110, loss = 0.18664655
Iteration 111, loss = 0.18374929
Iteration 112, loss = 0.18563278
Iteration 113, loss = 0.18450609
Iteration 114, loss = 0.18402619
Iteration 115, loss = 0.18236231
Iteration 116, loss = 0.18302651
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Neural Network:
	Accuracy: 92.44
	Precision: 88.31
	Recall: 93.17
	F1-Score: 90.68

Communicating the Results ¶

After applying data preprocessing, feature engineering and selection, training and testing of the spambase dataset, both individually and combined through the evaluation process, the model with the highest precision scores are Adaboost and random forest with an accuracy of 94.61 and 94.53 respectively. Other models performed well like the ANN with a score of 92.44. A decrease in the learning rate of the ANN to 0.01 showed a slight improvement to 92.96 but it took a longer time to execute. SVM had the least accuracy score of 89.04. It is recommended that Adaboost should be used when building a model to classify emails as Spam or Ham as its accuracy is higher than other classifiers.